Cost Reduction for Web-Based Data Imputation
نویسندگان
چکیده
Web-based Data Imputation enables the completion of incomplete data sets by retrieving absent field values from the Web. In particular, complete fields can be used as keywords in imputation queries for absent fields. However, due to the ambiguity of these keywords and the data complexity on the Web, different queries may retrieve different answers to the same absent field value. To decide the most probable right answer to each absent filed value, existing method issues quite a few available imputation queries for each absent value, and then vote on deciding the most probable right answer. As a result, we have to issue a large number of imputation queries for filling all absent values in an incomplete data set, which brings a large overhead. In this paper, we work on reducing the cost of Web-based Data Imputation in two aspects: First, we propose a query execution scheme which can secure the most probable right answer to an absent field value by issuing as few imputation queries as possible. Second, we recognize and prune queries that probably will fail to return any answers a priori. Our extensive experimental evaluation shows that our proposed techniques substantially reduce the cost of Web-based Imputation without hurting its high imputation accuracy.
منابع مشابه
Efficient Web-Based Data Imputation with Graph Model
A challenge for data imputation is the lack of knowledge. In this paper, we attempt to address this challenge by involving extra knowledge from web. To achieve high-performance web-based imputation, we use the dependency, i.e. FDs and CFDs, to impute as many as possible values automatically and fill in the other missing values with the minimal access of web, whose cost is relatively large. To m...
متن کاملAccuracy evaluation of different statistical and geostatistical censored data imputation approaches (Case study: Sari Gunay gold deposit)
Most of the geochemical datasets include missing data with different portions and this may cause a significant problem in geostatistical modeling or multivariate analysis of the data. Therefore, it is common to impute the missing data in most of geochemical studies. In this study, three approaches called half detection (HD), multiple imputation (MI), and the cosimulation based on Markov model 2...
متن کاملA New Hybrid model of Multi-layer Perceptron Artificial Neural Network and Genetic Algorithms in Web Design Management Based on CMS
The size and complexity of websites have grown significantly during recent years. In line with this growth, the need to maintain most of the resources has been intensified. Content Management Systems (CMSs) are software that was presented in accordance with increased demands of users. With the advent of Content Management Systems, factors such as: domains, predesigned module’s development, grap...
متن کاملAn Innovative Imputation and Classification Approach for Accurate Disease Prediction
Imputation of missing attribute values in medical datasets for extracting hidden knowledge from medical datasets is an interesting research topic of interest which is very challenging. One cannot eliminate missing values in medical records. The reason may be because some tests may not been conducted as they are cost effective, values missed when conducting clinical trials, values may not have b...
متن کاملWebPut: Efficient Web-Based Data Imputation
In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable o...
متن کامل